NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention

https://doi.org/10.1145/3720503

Gupta, Ahan; Yuan, Yueming; Jain, Devansh; Ge, Yuhao; Aponte, David; Zhou, Yanqi; Mendis, Charith (April 2025, Proceedings of the ACM on Programming Languages)

Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations fordiversesparse-MHSA patterns due to the underlying sparse formats they operate on. On one end, sparse libraries operate ongeneral sparse formatswhich target extreme amounts of random sparsity (<10% non-zero values) and have high metadata inO(nnzs). On the other end, hand-written kernels operate oncustom sparse formatswhich target specific sparse-MHSA patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in general sparse formats incurring high metadata overhead and custom sparse formats covering few sparse-MSHA patterns, trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT’s efficacy, we use it to generate code for various sparse-MHSA models, achieving speedups of up-to 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs in single-precision.
more » « less
Free, publicly-accessible full text available April 9, 2026
Large graph property prediction via graph segment training

Cao, Kaidi; Phothilimthana, Phitchaya Mangpo; Abu-El-Haija, Sami; Zelle, Dustin; Zhou, Yanqi; Mendis, Charith; Leskovec, Jure; Perozzi, Bryan (May 2024, International Conference on Neural Information Processing Systems)

Full Text Available
Large Graph Property Prediction via Graph Segment Training

Cao, Kaidi; Phothilimthana, Phitchaya Mangpo; Abu-El-Haija, Sami; Zelle, Dustin; Zhou, Yanqi; Mendis, Charith; Leskovec, Jure; Perozzi, Bryan (December 2023, Advances in neural information processing systems)

Learning to predict properties of a large graph is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST+EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST+EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.
more » « less
Full Text Available
GiPH: Generalizable Placement Learning for Adaptive Heterogeneous Computing

Hu, Yi; Zhang, Chaoran; Andert, Edward; Singh, Harshul; Shrivastava, Aviral; Laudon, James; Zhou, Yanqi; Iannucci, Bob; Joe-Wong, Carlee (May 2023, MLSys)

Careful placement of a distributed computational application within a target device cluster is critical for achieving low application completion time. The problem is challenging due to its NP-hardness and combinatorial nature. In recent years, learning-based approaches have been proposed to learn a placement policy that can be applied to unseen applications, motivated by the problem of placing a neural network across cloud servers. These approaches, however, generally assume the device cluster is fixed, which is not the case in mobile or edge computing settings, where heterogeneous devices move in and out of range for a particular application. To address the challenge of scaling to different-sized device clusters and adapting to the addition of new devices, we propose a new learning approach called GiPH, which learns policies that generalize to dynamic device clusters via 1) a novel graph representation gpNet that efficiently encodes the information needed for choosing a good placement, and 2) a scalable graph neural network (GNN) that learns a summary of the gpNet information. GiPH turns the placement problem into that of finding a sequence of placement improvements, learning a policy for selecting this sequence that scales to problems of arbitrary size. We evaluate GiPH with a wide range of task graphs and device clusters and show that our learned policy rapidly finds good placements for new problem instances. GiPH finds placements that achieve up to 30.5% better makespan, searching up to 3× faster than other search-based placement policies.
more » « less
GiPH: Generalizable Placement Learning for Adaptive Heterogeneous Computing

Hu, Yi; Zhang, Chaoran; Andert, Edward; Singh, Harshul; Shrivastava, Aviral; Laudon, James; Zhou, Yanqi; Iannucci, Bob; Joe-Wong, Carlee (January 2023, 6th Conference on Machine Learning and Systems)

Full Text Available
Reinforcement Learning Empowered MLaaS Scheduling for Serving Intelligent Internet of Things

https://doi.org/10.1109/JIOT.2020.2965103

Qin, Heyang; Zawad, Syed; Zhou, Yanqi; Padhi, Sanjay; Yang, Lei; Yan, Feng (January 2020, IEEE Internet of Things Journal)

Full Text Available
Swift machine learning model serving scheduling: a region based reinforcement learning approach

https://doi.org/10.1145/3295500.3356164

Qin, Heyang; Zawad, Syed; Zhou, Yanqi; Yang, Lei; Zhao, Dongfang; Yan, Feng (November 2019, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19))

The success of machine learning has prospered Machine-Learning-as-a-Service (MLaaS) - deploying trained machine learning (ML) models in cloud to provide low latency inference services at scale. To meet latency Service-Level-Objective (SLO), judicious parallelization at both request and operation levels is utterly important. However, existing ML systems (e.g., Tensorflow) and cloud ML serving platforms (e.g., SageMaker) are SLO-agnostic and rely on users to manually configure the parallelism. To provide low latency ML serving, this paper proposes a swift machine learning serving scheduling framework with a novel Region-based Reinforcement Learning (RRL) approach. RRL can efficiently identify the optimal parallelism configuration under different workloads by estimating performance of similar configurations with that of the known ones. We both theoretically and experimentally show that the RRL approach can outperform state-of-the-art approaches by finding near optimal solutions over 8 times faster while reducing inference latency up to 79.0% and reducing SLO violation up to 49.9%.
more » « less
Full Text Available
EPNAS: Efficient Progressive Neural Architecture Search

Zhou, Yanqi; Wang, Peng; Arik, Sercan; Yu, Haonan; Zawad, Syed; Yan, Feng; Greg, Diamos (September 2019, Proceedings of the 2019 30th British Machine Vision Conference (BMVC 2019))

In this paper, we propose Efficient Progressive Neural Architecture Search (EPNAS), a neural architecture search (NAS) that efficiently handles large search space through a novel progressive search policy with performance prediction based on REINFORCE [37]. EPNAS is designed to search target networks in parallel, which is more scalable on parallel systems such as GPU/TPU clusters. More importantly, EPNAS can be generalized to architecture search with multiple resource constraints, e.g., model size, compute complexity or intensity, which is crucial for deployment in widespread platforms such as mobile and cloud. We compare EPNAS against other state-of-the-art (SoTA) network architectures (e.g., MobileNetV2 [39]) and efficient NAS algorithms (e.g., ENAS [34], and PNAS [27]) on image recognition tasks using CIFAR10 and ImageNet. On both datasets, EPNAS is superior w.r.t. architecture searching speed and recognition accuracy
more » « less
Full Text Available
OpenPiton: an open source hardware platform for your research

https://doi.org/10.1145/3366343

Balkind, Jonathan; McKeown, Michael; Fu, Yaosheng; Nguyen, Tri; Zhou, Yanqi; Lavrov, Alexey; Shahrad, Mohammad; Fuchs, Adi; Payne, Samuel; Liang, Xiaohua; et al (November 2019, Communications of the ACM)
null (Ed.)
Full Text Available

Search for: All records